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Abstract. This work studies the problem of 2-dimensional searching 
for the 3-sided range query of the form [a,b] x (— oo,c] in both main 
and external memory, by considering a variety of input distributions. 
We present three sets of solutions each of which examines the 3-sided 
problem in both RAM and I/O model respectively. The presented data 
structures are deterministic and the expectation is with respect to the 
input distribution: 

(1) Under continuous ^-random distributions of the x and y coordinates, 
we present a dynamic linear main memory solution, which answers 3- 
sided queries in O(logn-l-t) worst case time and scales with O(loglogn) 
expected with high probability update time, where n is the current num- 
ber of stored points and t is the size of the query output. We external- 
ize this solution, gaining 0{logg n + t/B) worst case and 0{logBlogn) 
amortized expected with high probability I/Os for query and update 
operations respectively, where B is the disk block size. 

(2) Then, we assume that the inserted points have their ^-coordinates 
drawn from a class of smooth distributions, whereas the y-coordinates are 
arbitrarily distributed. The points to be deleted are selected uniformly 
at random among the inserted points. In this case we present a dynamic 
linear main memory solution that supports queries in 0(log log n-l-f) ex- 
pected time with high probability and updates in O(loglogn) expected 
amortized time, where n is the number of points stored and t is the 
size of the output of the query. We externalize this solution, gaining 
0(log logg n + t/B) expected I/Os with high probability for query oper- 
ations and 0(log^ logn) expected amortized I/Os for update operations, 
where B is the disk block size. The space remains linear 0{n/B). 

* This work is based on a combination of two conference papers that appeared in 
Proc. 21st International Symposium on Algorithms and Computation (ISAAC), 
2010: pages 1-12 (by all authors except third) and 13th International Conference 
on Database Theory (ICDT), 2010: pages 34-43 (by all authors except first). 
** Center for Massive Data Algorithmics, a Center of the Danish National Research 
Foundation. 



(3)Finally, we assume that the a;-coordinates are continuously drawn 
from a smooth distribution and the y-coordinates are continuously drawn 
from a more restricted class of realistic distributions. In this case and by 
combining the Modified Priority Search Tree [33] with the Priority Search 
Tree [29], we present a dynamic linear main memory solution that sup- 
ports queries in 0(loglogn + t) expected time with high probability and 
updates in O(loglogn) expected time with high probability. We exter- 
nalize this solution, obtaining a dynamic data structure that answers 
3-sided queries in 0(logg logn + t/B) I/Os expected with high proba- 
bility, and it can be updated in O(logglogn) I/Os amortized expected 
with high probability. The space remains linear 0{n/B). 

1 Introduction 

Recently, a significant effort has been performed towards developing worst case 
efficient data structures for range searching in two dimensions |:36^ • In their pio- 
neering work, Kanellakis et al. |1S], illustrated that the problem of indexing 
in new data models (such as constraint, temporal and object models), can be re- 
duced to special cases of two-dimensional indexing. In particular, they identified 
the 3-sided range searching problem to be of major importance. 

The 3-sided range query in the 2-dimensional space is defined by a region of 
the form R = [a, &] x (— cx),c], i.e., an "open" rectangular region, and returns 
all points contained in R. Figure [T] depicts examples of possible 3-sided queries, 
defined by the shaded regions. Black dots represent the points comprising the 
result. In many applications, only positive coordinates are used and therefore, the 
region defining the 3-sided query always touches one of the two axes, according 
to application semantics. 

Consider a time evolving database storing measurements collected from a 
sensor network. Assume further, that each measurement is modeled as a multi- 
attribute tuple of the form <id, ai, 02, a^, time>, where id is the sensor identi- 
fier that produced the measurement, d is the total number of attributes, each a,, 
1 < i < d, denotes the value of the specific attribute and finally time records the 
time that this measurement was produced. These values may relate to measure- 
ments regarding temperature, pressure, humidity, and so on. Therefore, each 
tuple is considered as a point in JM^ space. Let F: R"* — R be a real-valued 
ranking function that scores each point based on the values of the attributes. 
Usually, the scoring function F is monotone and without loss of generality we 
assume that the lower the score the "better" the measurement (the other case 
is symmetric). Popular scoring functions are the aggregates sum, min, avg or 
other more complex combinations of the attributes. Consider the query: "search 
for all measurements taken between the time instances ii and t2 such that the 
score is below s". Notice that this is essentially a 2-dimensional 3-sided query 
with time as the x axis and score as the y axis. Such a transformation from a 
multi-dimensional space to the 2-dimensional space is common in applications 
that require a temporal dimension, where each tuple is marked with a timestamp 
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Fig. 1. Examples of 3-sided queries. 

storing the arrival time [28]. This query may be expressed in SQL as follows: 

SELECT id, score, time 
FROM SENSOR_DATA 

WHERE time>^ti AND time<^t2 AND score<=s; 



It is evident, that in order to support such queries, both search and update 
operations (i.e., insertions/deletions) must be handled efficiently. Search effi- 
ciency directly impacts query response time as well as the general system perfor- 
mance, whereas update efficiency guarantees that incoming data are stored and 
organized quickly, thus, preventing delays due to excessive resource consump- 
tion. Notice that fast updates will enable the support of stream-based query 
processing [8] (e.g., continuous queries), where data may arrive at high rates 
and therefore the underlying data structures must be very efficient regarding 
insertions/deletions towards supporting arrivals/expirations of data. There is 
a plethora of other applications (e.g., multimedia databases, spatio-temporal) 
that fit to a scenario similar to the previous one and they can benefit by efficient 
indexing schemes for 3-sided queries. 

Another important issue in such data intensive applications is memory con- 
sumption. Evidently, the best practice is to keep data in main memory if this is 
possible. However, secondary memory solutions must also be available to cope 
with large data volumes. For this reason, in this work we study both cases offering 
efficient solutions both in the RAM and I/O computation models. In particular, 
the rest of the paper is organized as follows. In Section [31 we discuss prelimi- 
nary concepts, define formally the classes of used probability distributions and 
present the data structures that constitute the building blocks of our construc- 
tions. Among them, we introduce the External Modified Priority Search Tree. In 
Section [4| we present the two theorems that ensure the expected running times 
of our constructions. The first solution is presented in Section [SJ whereas our 
second and third constructions are discussed in Sections [B] and [7| respectively. 
Finally, Section [8] concludes the work and briefly discusses future research in the 
area. 
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Table 1. Bounds for dynamic 3-sidcd planar range reporting. The number of 
points in the structure is n, the size of the query output is t and the size of the 
block is B. 

" randomized algorithm and expected time bound 

^ X and {/-coordinates are drawn from an unknown /i-random distribution, the /i func- 
tion never changes, deletions are uniformly random over the inserted points 
expected with high probability 

x-coordinates are smoothly distributed, j/-coordinates are arbitrarily distributed, 
deletions are uniformly random over the inserted points 
amortized expected 

^ we restrict the x-coordinate distribution to be {f{n),g{n))- smooth, for appropriate 
functions / and g depending on the model, and the j/-coordinate distribution to 
belong to a more restricted class of distributions. The smooth distribution is a su- 
perset of uniform and regular distributions. The restricted class contains realistic 
distributions such as the Zipfian and the Power Law 

^ amortized 
amortized expected w.h.p. 



The usefulness of 3-sided queries has been underlined many times in the liter- 
ature [111126] . Apart from the significance of this query in multi-dimensional data 
intensive applications [TU [15] > 3-sided queries appear in probabilistic threshold 
queries in uncertain databases. Such queries are studied in a recent work of Cheng 
et. al. |11] . The problem has been studied both in main memory (RAM model) 
and secondary storage (I/O model). In the internal memory, the most commonly 
used data structure for supporting 3-sided queries is the priority search tree of 
McCreight [29]. It supports queries in 0(logn -I- t) worst case time, insertions 
and deletions of points in O(logn) worst case time and uses linear space, where 
n is the number of points and t the size of the output of a query. It is a hybrid 
of a binary heap for the y-coordinates and of a balanced search tree for the 
^-coordinates. 

In the static case, when points have x-coordinates in the set of integers 
{0, . . . , n — 1}, the problem can be solved in 0{n) space and preprocessing time 



with 0{t + 1) query time [5], using a range minimum query data structure [TS] 
(see also Sec. 

In the RAM model, the only dynamic sublogarithmic bounds for this problem 
are due to WiUard [38] who attains O (log n/ log log n) worst case or 0{^/logn) 
randomized update time and O (log n/ log log n + t) query time using linear space. 
This solution poses no assumptions on the input distribution. 

Many external data structures such as grid files, various quad-trees, z-orders 
and other space filling curves, k-d-B-trees, hB-trees and various R-trees have 
been proposed. A recent survey can be found in [T7] . Often these data structures 
are used in applications, because they are relatively simple, require linear space 
and perform well in practice most of the time. However, they all have highly 
sub-optimal worst case (w.c.) performance, whereas their expected performance 
is usually not guaranteed by theoretical bounds, since they are based on heuristic 
rules for the construction and update operations. 

Moreover, several attempts have been performed to externalize Priority Search 
Trees, including [9], [19], [26], [32] and [34], but all of them have not been opti- 
mal. The worst case optimal external memory solution (External Priority Search 
Tree) was presented in [S]. It consumes 0{n/B) disk blocks, performs 3-sided 
range queries in 0(log^ n -I- t/ B) I/Os w.c. and supports updates in 0(log^ n) 
I/Os amortized. This solution poses no assumptions on the input distribution. 

In this work, we present new data structures for the RAM and the I/O 
model that improve by a logarithmic factor the update time in an expected sense 
and attempt to improve the query complexity likewise. The bounds hold with 
high probability (w.h.p.) under assumptions on the distributions of the input 
coordinates. We propose three multi-level solutions, each with a main memory 
and an external memory variant. 

For the first solution, we assume that the x and y coordinates are being 
continuously drawn from an unknown /i-random distribution. It consists of two 
levels, for both internal and external variants. The upper level of the first solution 
consists of a single Priority Search Tree [29] that indexes the structures of the 
lower level. These structures are Priority Search Trees as well. For the external 
variant we substitute the structures with their corresponding optimal external 
memory solutions, the External Priority Search Trees [5]. The internal variant 
achieves Oilogn + t) w.c. query time and O(loglogn) expected w.h.p. update 
time, using linear space. The external solution attains 0(log^ n + t/B) I/Os w.c. 
and 0{\ogg\ogn) I/Os amortized expected w.h.p. respectively, and uses linear 
space. 

For the second solution, we consider the case where the x-coordinates of 
inserted points are drawn from a smooth probabilistic distribution, and the y- 
coordinates are arbitrarily distributed. Moreover, the deleted points are selected 
uniformly at random among the points in the data structure and queries can be 
adversarial. The assumption on the ^-coordinates is broad enough to include dis- 
tributions used in practice, such as uniform, regular and classes of non-uniform 
ones [H |23]. We present two linear space data structures, for the RAM and 
the I/O model respectively. In the former model, we achieve a query time of 



0(log log n+t) expected with high probabihty and update time of 0(log log n) ex- 
pected amortized. In the latter model, the I/O complexity is 0(loglog5 n + t/B) 
expected with high probability for the query and 0(\ogQ ^ogn) expected amor- 
tized for the updates. In both cases, our data structures are deterministic and 
the expectation is derived from a probabilistic distribution of the a;-coordinates, 
and an expected analysis of updates of points with respect to their y-coordinates. 

By the third solution, we attempt to improve the expected query complexity 
and simultaneously preserve the update and space complexity. In order to do 
that, we restrict the s-coordinate distribution to be {f (n), g{n))- smooth, for 
appropriate functions / and g depending on the model, and the y-coordinate 
distribution to belong to a more restricted class of distributions. The smooth 
distribution is a superset of uniform and regular distributions. The restricted 
class contains realistic distributions such as the Zipfian and the Power Law. The 
internal variant consists of two levels, of which the lower level is identical to 
that of the first solution. We implement the upper level with a static Modified 
Priority Search Tree |33) . For the external variant, in order to achieve the desired 
bounds, we introduce three levels. The lower level is again identical to that of 
the first solution, while the middle level consists of 0{B) size buckets. For the 
upper level we use an External Modified Priority Search Tree, introduced here 
for the first time. The latter is a straight forward externalization of the Modified 
Priority Search Tree and is static as well. In order to make these trees dynamic 
we use the technique of global rebuilding ^27j . The internal version reduces the 
query complexity to 0(loglogn + t) expected with high probability and the 
external to 0{\og^ \ogn + t/ B) I/Os expected with high probability. The results 
are summarized in Table 1. 



3 Data Structures and Probability Distributions 

For the main memory solutions we consider the RAM model of computation. 
We denote by n the number of elements that reside in the data structures and 
by t the size of the query. The universe of elements is denoted by S. When we 
mention that a data structure performs an operation in an amortized expected 
with high probability complexity, we mean the bound is expected to be true 
with high probability, under a worst case sequence of insertions and deletions of 
points. 

For the external memory solutions we consider the I/O model of computation 
|36j . That means that the input resides in the external memory in a blocked 
fashion. Whenever a computation needs to be performed to an element, the 
block of size B that contains that element is transferred into main memory, 
which can hold at most M elements. Every computation that is performed in 
main memory is free, since the block transfer is orders of magnitude more time 
consuming. Unneeded blocks that reside in the main memory are evicted by 
a LRU replacement algorithm. Naturally, the number of block transfers {I/O 
operation) consists the metric of the I/O model. 



Furthermore, we will consider that the points to be inserted are continuously 
drawn by specific distributions, presented in the sequel. The term continuously 
implies that the distribution from which we draw the points remains unchanged. 
Since the solutions are dynamic, the asymptotic bounds are given with respect 
to the current size of the data structure. Finally, deletions of the elements of 
the data structures are assumed to be uniformly random. That is, every element 
present in the data structure is equally likely to be deleted pp] . 

3.1 Probability Distributions 

In this section, we overview the probabilistic distributions that will be used in 
the remainder of the paper. We will consider that the x and y-coordinates are 
distinct elements of these distributions and will choose the appropriate distribu- 
tion according to the assumptions of our constructions. 

A probability distribution is ^-random if the elements are drawn randomly 
with respect to a density function denoted by ^. For this paper, we assume that 
fj, is unknown. 

Informally, a distribution defined over an interval / is smooth if the probabil- 
ity density over any subinterval of / does not exceed a specific bound, however 
small this subinterval is (i.e., the distribution does not contain sharp peaks). 
Given two functions fi and /2, a density function ^ = ii[a, b]{x) is (/i, f2)-smooth 
[30ll4] if there exists a constant /3, such that for all ci, C2, C3, a < ci < C2 < C3 < &, 
and all integers n, it holds that: 

/ fi[ci,C3\(x)dx < 

where //[ci,C3](x) = for a; < ci or a; > C3, and /i[ci, C3](a;) = ii{x)/p for 
Ci < a; < C3 where p = J^^ ^{x)dx. Intuitively, function /i partitions an arbitrary 

subinterval [01,03] C [a,b] into /i equal parts, each of length ^^jf^ = 0{j^); that 
is, fl measures how fine is the partitioning of an arbitrary subinterval. Function 
/2 guarantees that no part, of the /i possible, gets more probability mass than 
that is, /2 measures the sparseness of any subinterval [c2 — ^^jf^,C2] C 
[01,03]. The class of (/i, /2)-smooth distributions (for appropriate choices of fi 
and /2) is a superset of both regular and uniform classes of distributions, as well 
as of several non-uniform classes [H I23j . Actually, any probability distribution 
is (/i, 69(n))-smooth, for a suitable choice of f3. 

The grid distribution assumes that the elements are integers that belong to 
a specific range [1 , M] . 

We define the restricted class of distributions as the class that contains dis- 
tributions used in practice, such as the Zipfian, Power Law, e.t.c. 

The Zipfian distribution is a distribution of probabilities of occurrence that 
follows Zipf 's law. Let N be the number of elements, k be their rank and s be the 
value of the exponent characterizing the distribution. Then Zipf 's law is defined 
as the function /(fc; s, N) = ^n^\ , ^ ■ Intuitively, few elements occur very often, 
while many elements occur rarely. 



The Power Law distribution is a distribution over probabilities that satisfy 
Pr[X > x] = cx^^ for constants c, 6 > 0. 

3.2 Data Structures 

In this section, we describe the data structures that we will combine in order to 
achieve the desired complexities. 

Priority Search Trees: The classic Priority Search Tree (PST) [55] stores 
points in the 2-d space. One of the most important operations that the PST 
supports is the 3-sided query. The 3-sided query consists of a half bounded 
rectangle [a, 6] x (— cx),c] and asks for all points that lie inside this area. Note 
that by rotation we can unbound any edge of the rectangle. The PST supports 
this operation in 0(logn + 1) w.c, where n is the number of points and t is the 
number of the reported points. 

The PST is a combination of a search tree and a priority queue. The search 
tree (an (a, 6)-tree suffices) allows the efficient support of searches, insertions 
and deletions with respect to the x-coordinate, while the priority queue allows 
for easy traversal of points with respect to their y-coordinate. In particular, the 
leaves of the PST are the points sorted by x-coordinate. In the internal nodes 
of the tree there are artificial values which are used for the efficient searching of 
points with respect to their z-coordinate. In addition, each internal node stores a 
point that has the minimum y-coordinate among all points stored in its subtree. 
This corresponds to a tournament on the leaves of the PST. For example, the 
root of the PST contains a point which has minimum y-coordinate among all 
points in the plane, as well as a value which is in the interval defined between the 
x-coordinates of the points stored in the rightmost leaf of the left subtree and 
the leftmost leaf of the right subtree (this is true in the case of a binary tree). 
A PST implemented with an red-black tree supports the operations of insertion 
of a new point, deletion of an existing point and searching for the x-coordinate 
of a point in O(logn) worst case time. 

Regarding the I/O model, after several attempts, a worst case optimal solu- 
tion was presented by Arge et al. in [5] . The proposed indexing scheme consumes 
0(n/B) space, supports updates in 0{\oggn) amortized I/Os and answers 3- 
sided range queries in 0(log^ n+t/ B) I/Os. We will refer to this indexing scheme 
as the External Priority Search Tree (EPST). 

Interpolation Search Trees: In [24j, a dynamic data structure based on in- 
terpolation search (IS-Tree) was presented, which consumes linear space and 
can be updated in 0(1) time w.c. Furthermore, the elements can be searched in 
O(loglogn) time expected w.h.p., given that they are drawn from a (n",n^)- 
smooth distribution, for any arbitrary constants < a, /3 < 1. The external- 
ization of this data structure, called interpolation search B-tree (ISB-tree), was 
introduced in [21]. It supports update operations in 0(1) worst-case I/Os pro- 
vided that the update position is given and search operations in O(log^logn) 



I/Os expected w.h.p. The expected search bound holds w.h.p. if the elements 
are drawn by a (^/(loglogri)^''''^, n^~'')-smooth distribution, where e > and 
6=1—-^ are constants. The worst case search bound is 0{\og^ n) block trans- 
fers. 

Weight Balanced Exponential Tree: The exponential search tree is a tech- 
nique for converting static polynomial space search structures for ordered sets 
into fully-dynamic linear space data structures. It was introduced in [1] I35[ |6] 
for searching and updating a dynamic set U oin integer keys in linear space and 
optimal 0(-y/logri/loglogn) time in the RAM model. Effectively, to solve the dic- 
tionary problem, a doubly logarithmic height search tree is employed that stores 
static local search structures of size polynomial to the degree of the nodes. 

Here we describe a variant of the exponential search tree that we dynamize 
using a rebalancing scheme relative to that of the weight balanced search trees [7] . 
In particular, a weight balanced exponential tree T on n points is a leaf-oriented 
rooted search tree where the degrees of the nodes increase double exponentially 
on a leaf-to-root path. All leaves have the same depth and reside on the lowest 
level of the tree (level zero). The weight of a subtree rooted at node u is defined 
to be the number of its leaves. If u lies at level i > 1, the weight of ranges 
within [i • Wj + 1, 2 • — l] , for a weight parameter Wi — c^^ and constants 
C2 > 1 and ci > 2^/('=2-i) (ggg Lem.[2]). Note that Wi+i = The root does not 
need to satisfy the lower bound of this range. The tree has height 0{\og^^ logc^ "•)• 

The insertion of a new leaf to the tree increases the weight of the nodes 
on the leaf-to-root path by one. This might cause some weights to exceed their 
range constraints ("overflow"). We rebalance the tree in order to revalidate the 
constraints by a leaf-to-root traversal, where we "split" each node that over- 
flowed. An overflown node u at level i has weight 2wi. A split is performed by 
creating a new node v that is a sibling of u and redistributing the children of u 
among u and v such that each node acquires a weight within the allowed range. 
In particular, we scan the children of u, accumulating their weights until we 
exceed the value Wi, say at child x. Node u gets the scanned children and v gets 
the rest. Node x is assigned as a child to the node with the smallest weight. 
Processing the overflown nodes u bottom up guarantees that, during the split 
of M, its children satisfy their weight constraints. 

The deletion of a leaf might cause the nodes on the leaf-to-root path to 
"underflow", i.e. a node u at level i reaches weight ^Wj. By an upwards traversal 
of the path, we discover the underflown nodes. In order to revalidate their node 
constraints, each underflown node chooses a sibling node v to "merge" with. 
That is, we assign the children of w to w and delete u. Possibly, v needs to 
"split" again if its weight after the merge is more than ^Wi ("share"). In either 
case, the traversal continues upwards, which guarantees that the children of the 
underflown nodes satisfy their weight constraints. The following lemma, which 
is similar to [3 Lem. 9], holds. 

Lemma 1. After rebalancing a node u at level i, f2(wi) insertions or deletions 
need to be performed on Tu, for u to overflow or underflow again. 



Proof. A split, a merge or a share on a node u on level i yield nodes with 
weight in [^Wi — Wi^i, + If we set Wi-i < ■^Wi, which always holds 

for ci > 2^^^'^^^^\ this interval is always contained in ^Wi]. □ 
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Fig. 2. The hnear space MPST. 



Range Minimum Queries The range minimum query (RMQ) problem asks 
to preprocess an array of size n such that, given an index range, one can report 
the position of the minimum element in the range. In [TB] the RMQ problem 
is solved in 0(1) time using 0{n) space and preprocessing time. The currently 
most space efficient solution that support queries in 0(1) time appears in |13) . 



Dynamic External Memory 3-sided Range Queries for 0{B^) Points. 

For external memory, Arge et al. [S] present the following lemma for handling a 
set of at most points. 

Lemma 2. A set of K < points can be stored in 0{K/B) blocks, so that 
3-sided queries need 0{t/B + 1) I/Os and updates 0{1) I/Os, for output size t 

Proof. See Lemma 1 presented in |5]. 

Modified Priority Search Trees A Modified Priority Search Tree (MPST) 
is a static data structure that stores points on the plane and supports 3-sided 
queries. It is stored as an array {Arr) in memory, yet it can be visualized as a 
complete binary tree. Although it has been presented in [35], [33] we sketch it 
here again, in order to introduce its external version. 

Let T be a Modified Priority Search Tree (MPST) [33] which stores n points 
of S (see figure 2). We denote by r„ the subtree of T with root v. Let m be a 
leaf of the tree. Let P„ be the root-to-leaf path for u. For every m, we sort the 
points in P^ by their y-coordinate. We denote by P^ the subpath of Pu with 
nodes of depth bigger or equal to j (The depth of the root is 0). Similarly 
(respectively i?:^ ) denotes the set of nodes that are left (resp. right) children of 
nodes of P^ and do not belong to P^ . The tree structure T has the following 
properties: 

— Each point of S is stored in a leaf of T and the points are in sorted x-order 
from left to right. 

— Each internal node v is equipped with a secondary list S{v). S{v) contains 
in the points stored in the leaves of r„ in increasing y-coordinate. 

— A leaf u also stores the following lists A{u), P^{u), L\u) and R-'{u), for 
< j < logn. The list P^{u), L^{u) and R'{u) store, in increasing y- 
coordinate, pointers to the respective internal nodes. A{u) is an array that 
indexes j. 

Note that the first element of the list S{v) is the point of the subtree with 
minimum y-coordinate. Also note that < j < log n, so there are log n such 
sets Pi, L{, Ri^ for each leaf u. Thus the size of A is logn and for a given j, 
any list P^{u), L^{u) or W {u) can be accessed in constant time. By storing the 
nodes of the tree T according to their inorder traversal in an array Arr of size 
0(n), we can imply the structure of tree T. Also each element of Arr contains 
a binary label that corresponds to the inorder position of the respective node of 
T, in order to facilitate constant time lowest common ancestor (LCA) queries. 

To answer a query with the range [a, 6] x (— oo,c] we find the two leaves u, 
w of Arr that contain a and b respectively. If we assume that the leaves that 
contain a, b are given, we can access them in constant time. Then, since Arr 
contains an appropriate binary label, we use a simple LCA (Lowest Common 
Ancestor) algorithm [TB] [18] to compute the depth j of the nearest common 
ancestor of u, w in 0(1) time. That is done by performing the XOR operation 



between the binary labels of the leaves u and w and finding the position of the 
first set bit provided that the left-most bit is placed in position 0. Afterwards, 
we traverse (u) until the scanned y-coordinate is not bigger than c. Next, 
we traverse W{u), U {w) in order to find the nodes whose stored points have 
2/-coordinate not bigger than c. For each such node v we traverse the list S{v) 
in order to report the points of Arr that satisfy the query. Since we only access 
points that lie in the query, the total query time is 0(t), where t is the answer 
size. 

The total size of the lists S(u) for each level of T is 0{n). Each of the 0{n) 
leaves stores logn lists Pj, Lj and Rj, each of which consumes O(logn) space. 
Thus the space for these lists becomes O(nlog^n). By implementing these lists 
as partially persistent sorted lists [TU], their total space becomes O(nlogn), 
resulting in a total space of 0{n\ogn) for these lists. Thus, the total space 
occupied by T is 0{n\ogn). 

We can reduce the space of the structure by pruning as in [Ml [31] . However, 
pruning alone does not reduce the space to linear. We can get better but not 
optimal results by applying pruning recursively. To get an optimal space bound 
we will use a combination of pruning and table lookup. The pruning method is 
as follows: Consider the nodes of T, which have height log logn. These nodes 
are roots of subtrees of T of size 0(log n) and there are 0{n/ log n) such nodes. 
Let Ti be the tree whose leaves are these nodes and let be the subtrees of 
these nodes for 1 < i < 0(n/logri). We call Ti the first layer of the structure 
and the subtrees the second layer. Ti and each subtree is by itself a 
Modified Priority Search Tree. Note that Ti has size 0(ri/logn) = 0{n). Each 
subtree has 0(logri/ log logn) leaves and depth O(loglogn). The space for 
the second layer is O(nlogn). By applying the pruning method to all the trees 
of the second layer we get a third layer which consists of 0(7i/ log log n) modified 
priority search trees each of size 0(log log n). Ignoring the third layer, the second 
layer needs now linear space, while the O(nlogn) space bottleneck is charged 
on the third level. If we use table lookup 15 to implement the modified priority 
search trees of the third layer we can reduce its space to linear, thus consuming 
linear space in total. 

In order to answer a query on the three layered structure we access the 
microtrees that contain a and b and extract in 0(1) time the part of the answer 
that is contained in them. Then we locate the subtrees Tj, T| that contain 
the representative leaves of the accessed microtrees and extract the part of the 
answer that is contained in them by executing the query algorithm of the MP ST. 
The roots of these subtrees are leaves of Ti. Thus we execute again the MPST 
query algorithm on Ti with these leaves as arguments. Once we reach the node 
with y-coordinate bigger than c, we continue in the same manner top down. 
This may lead us to subtrees of the second layer that contain part of the answer 
and have not been accessed yet. That means that for each accessed tree of the 
second layer, we execute the MPST query algorithm, where instead of a and b, 
we set as arguments the minimum and the maximum x-coordinates of all the 
points stored in the queried tree. The argument c remains, of course, unchanged. 



Correspondingly, in that way we access the microtrees of the third layer that 
contain part of the answer. We execute the top down part of the algorithm on 
them, in order to report the final part of the answer. 

Lemma 3. Given a set of n points on the plane we can store them in a static 
data structure with 0(n) space that allows three-sided range queries to be an- 
swered in 0{t) worst case, where t is the answer size. 

Proof. See [33]. 

The External Modified Priority Search Tree (EMPST) is similar to the MPST, 
yet wc store the lists in a blocked fashion. In order to attain linear space in exter- 
nal memory we prune the structure k times, instead of two times. The pruning 
terminates when log^*^-* n — 0{B). Since computation within a block is free, we 
do not need the additional layer of microtrees. By that way we achieve 0{n/B) 
space. 

Assume that the query algorithm accesses first the two leaves u and v of the 
k-th layer of the EMPST, which contain a and b respectively. If they belong to 
different EMPSTs of that layer, we recursively take the roots of these EMPSTs 
until the roots r„ and r„ belong to the same EMPST, w.l.o.g. the one on the 
upper layer. That is done in 0{k) — 0(1) I/Os. Then, in 0(1) I/Os we access the 
j-th entry of A(r,j) and A{rv), where j is the depth of LCA{ru,ry), thus also the 
corresponding sublists P-' i?-' (r„), L-' (r„) and P^ [r^), W [r.^), U {r^). Since 
these sublists are y-ordered, by scanning them in ti/B I/Os we get all the ti 
pointers to the S'-lists that contain part of the answer. We access the S'-lists 
in ti I/Os and scan them as well in order to extract the part of the answer 
(let's say ^2) they contain. We then recursively access the ^2 S'-lists of the layer 
below and extract the part ^3 that resides on them. In total, we consume ti/B -\- 
ti ■ t2/B -\- ... -\- ti-i ■ ti/B + ... -\- tk-i ■ tk/B I/Os. Let pi the probability that 
ti = tP' where t is the total size of the answer and X]i=iP« ~ 1- Thus, we 
need tP^/B + Y^'^^I^ ^ • iP'+i I/Os or tP^/B -f ^Jl/ I/Os. Assuming 

w.h.p. an equally likely distribution of answer amongst the k layers, we need 

ti /B + Y!IzI expected number of I/Os or ti /B + Y!IZI 4- ^i^^^^ ^ » 

we need totally 0{t/B) expected w.h.p. number of I/Os. 

Lemma 4. Given a set of n points on the plane we can store them in a static 
data structure with 0{n/B) space that allows three-sided range queries to be 
answered in 0{t/B) expected w.h.p. case, where t is the size of the answer. 

4 Expected First Order Statistic of Uknown Distributions 

In this section, we prove two theorems that will ensure the expected running 
times of our constructions. They are multilevel data structures, where for each 
pair of levels, the upper level indexes representative elements (in our case, point 
on the plane) of the lower level buckets. We call an element violating when its 
insertion to or deletion from the lower level bucket causes the representative of 



that bucket to change, thus triggering an update on the upper level. We prove 
that for an epoch of O(logn) updates, the number of violating elements is 0(1) 
if they are continuously being drawn from a /i-random distribution. Secondly, 
we prove that for a broader epoch of 0{n) updates, the number of violating 
elements is O(logn), given that the elements are being continuously drawn from 
a distribution that belongs to the restricted class. Violations are with respect 
to the y-coordinates, while the distribution of elements in the buckets are with 
respect to a:;-coordinates. 

But first, the proof of an auxiliary lemma is necessary. Assume a sequence <S 
of distinct numbers generated by a continuous distribution fi = T over a universe 
U. Let |<S| denote the size of <S. Then, the following holds: 

Lemma 5. The probability that the next element q drawn from T is less than 
the minimum element s in S is equal to \ s\+i ■ 

Proof. Suppose that we have n random observations X\,. . . ,Xn from an un- 
known continuous probability density function f{X), with cumulative distribu- 
tion /X = F{X), X G [a, b]. We want to compute the probability that the {n + 
l) — th observation is less than min {Xi, . . . Let = min {Xi,. . . ,X„}. 

Therefore, P{X„+i < = Ex^{^n+i < ^(i)/^(i) = ^} " -^{^(i) = ^} 

(a). 

It is easy to see that P {X^+i < X(i)/X(i) = x} = F{X) = P{Xn+i < x} 
(/3). Also P =x}=n- fix) ■ (izl) ■ Fixf-' ■ (1 - ^(X))"-^ (7), where 
Xj^j,-) is the k — th smallest value in {Xi, . . . , X„}. 

In our case fc = 1, which intuitively means that we have n choices for one in 
{Xi, . . . , Xn} being the smallest value. This is true if all the rest n — 1 are more 
than X, which occurs with probability: (1 - F{X)f = {1 - P {X < x}f~\ 
By (/3) and (7), expression (a) becomes: 

P{Xn+i < = • f{X) (^ij) • F{X) ■ (1 - F{X)r-' dX. After some 

mathematical manipulations, we have that: 

P {X„+i < } = /> • f{X) ■ (1 ~ F{X)r-' ■ F(X)dX = 

/; [- (1 - F{X)r]F{X) dX = /; [- (1 - F{X)r ■ Fix)] dX + (1 - F(X))" 

F' {X)dX - {- (1 - F{X)y" ■ Fix)\l [ d-TO+^ j 'dx = -{l- F{b)y 

Fib) + (1 - P(«))" . Fia) - { ii-ii^ 1^} = - { (i^gl^li - ^M^} = 
1 

n+l 

Apparently, the same holds if we want to maintain the maximum element of 
the set S. 



Proposition 1 Suppose that the input elements have their x-coordinate gener- 
ated by an arbitrary continuous distribution jj, on [a, b] C 3?. Let n be the ele- 
ments stored in the data structure at the laiest reconstruction. An epoch starts 
with logn updates. During the i-th update let N(i) € [n,r ■ n], with constant 
r > 1, denote the number of elements currently stored into the buckets that 



partition [a, b] C 5R. Then the N{i) elements remain j2 randomly distributed in 
the buckets per i-th update. 

Proof. The proof is analogous to Lem. 2] and is omitted. 

Theorem 2. For a sequence of 0(logn) updates, the expected number of violat- 
ing elements is 0(1), assuming that the elements are being continuously drawn 
from a pi-random distribution. 

Proof. According to Prop. [T] there are N(i) G [n,r ■ n] (with constant r > 1) 
elements with their x-coordinates /Lt-randomly distributed in the buckets j = 
1, . . . , i^^, that partition [a, b] C 3ff. By [23l Th. 4], with high probability, each 

bucket j receives an x-coordinate with probability = 0(^^^). It follows that 
during the z-th update operation, the elements in bucket j is a Binomial random 
variable with mean pj ■ N{i) = 0{logn). 

The elements with x-coordinates in an arbitrary bucket j are aN(i) with 

probability Cjj%)pf^''\l - ^ (^)" . In turn, 

these are < aN{i) — ^N{i) (less than half of the bucket's mean) with proba- 
bility 

N{i} 



(1) 



as n — >■ oo and a 



— El 

2 ■ 

Suppose that an element is inserted in the i-th update. It induces a violation 
if its y-coordinate is strictly the minimum element of the bucket j it falls into. 

— If the bucket contains > ^logA^(i) > ^logn coordinates then by Lemma 
[5] element y incurs a violation with probability Q( iog„ )- 

— If the bucket contains < ^ log N{i) coordinates, which is as likely as in Eq. 
([1]), then element y may induce < 1 violation. 

Putting these cases together, element y expectedly induces at most 0(j^^)-|-Eq. 
([T])= 0{j^^) violations. We conclude that during the whole epoch of logn in- 
sertions the expected number of violations are at most log n ■ 0( j^ji^) plus log n- 
Eq. (H]) which is 0(1). 

Theorem 3. For a sequence of 0(n) updates, the expected number of violating 
elements is O(logn), assuming that x— coordinates are drawn from a continuous 
smooth distribution and the y— coordinates are drawn from the restricted class 
of distributions ('power-law or zipfianj. 

Proof. Suppose an element is inserted, with its y-coordinate following a discrete 
distribution (while its z-coordinate is arbitrarily distributed) in the universe 
{2/1,2/2, •■ •} with yi < 2/i+i,Vi > 1. Also, let q = Pr[y > yi] and y* the min 
y-coordinate of the elements in bucket j as soon as the current epoch starts. 
Clearly, the element just inserted incurs a violation when landing into bucket j 
with probability Pi[y < y*]. 



— If the bucket contains > ^ log N{i) > ^\ogn coordinates, then coordinate 

y incurs a violation with probability < q~ i°g". (In other words, a violation 
may happens when at most all the X2(log n) coordinates of the elements in 
bucket j are > yi, that is, when y* > yi.) 

— If the bucket contains < ^ log N{i) coordinates, which is as likely as in Eq. 
(HI) then coordinate y may induces < 1 violation. 

All in all, y coordinate expectedly induces < g^^(i°g")-|- Eq. ([T|) violations. Thus, 
during the whole epoch of n insertions the expected number of violations are at 
most n • (g^?(i°g")) + n- Eq. (P = nq"^'^°s"') + o(l) violations. This is at most 

c-logn = O(logn) if g < f 



Remark 4 Note that Power Law and Zipfian distributions have the aforemen- 

(logn)-i 



tioned property that q < ^ ^ " ^ 



e ^ as n oo. 



5 The First Solution for Random Distributions 



In this section, we present the construction that works under the assumptions 
that the x and ^-coordinates are continuously drawn by an unknown /i-random 
distribution. 

The structure we propose consists of two levels, as well as an auxiliary data 
structure. All of them are implemented as PSTs. The lower level partitions the 
points into buckets of almost equal logarithmic size according to the a;-coordinate 
of the points. That is, the points are sorted in increasing order according to 
a;-coordinate and then divided into sets of 0(log n) elements each of which con- 
stitutes a bucket. A bucket C is implemented as a PST and is represented by 
a point C""™ which has the smallest y-coordinate among all points in it. This 
means that for each bucket the cost for insertion, deletion and search is equal to 
O(loglogn), since this is the height of the PST representing C. 

The upper level is a PST on the representatives of the lower level. Thus, 
the number of leaves in the upper level is O (^j^^^ ■ As a result, the upper level 

supports the operations of insert, delete and search in 0(log n) time. In addition, 
we keep an extra PST for insertions of violating points. Under this context, we 
call a point p violating, when its y-coordinate is less than (7"**"- of the bucket C 
in which it should be inserted. In the case of a violating point we must change 
the representative of C and as a result we should make an update operation on 
the PST of the upper level, which costs too much, namely O(logn). 

We assume that the x and y-coordinates are drawn from an unknown /it- 
random distribution and that the fi function never changes. Under this assump- 
tion, according to the combinatorial game of bins and balls, presented in Section 
5 of [53], the size of every bucket is 0{\og'^ n), where c > is a constant, and no 
bucket becomes empty w.h.p. We consider epochs of size O(logn), with respect 
to update operations. During an epoch, according to Theorem O the number 



of violating points is expected to be 0(1) w.h.p. The extra PST stores exactly 
those 0(1) violating points. When a new epoch starts, we take all points from 
the extra PST and insert them in the respective buckets in time O(loglogn) 
expected w.h.p. Then we need to incrementally update the PST of the upper 
level. This is done during the new epoch that just started. In this way, we keep 
the PST of the upper level updated and the size of the extra PST constant. 
As a result, the update operations are carried out in O(loglogn) time expected 
w.h.p., since the update of the upper level costs 0(1) time w.c. 

The 3-sided query can be carried out in the standard way. Assume the query 
[a,b] X (— oo,c]. First we search down the PST of the upper level for a and b. 
Let Pa be the search path for a and Pi, for b respectively. Let P,„ — Pa O Pb- 
Then, we check whether the points in the nodes on Pa U Pb belong to the answer 
by checking their x-coordinate as well as their y-coordinate. Then, we check all 
right children of Pa — Pm as well as all left children of Pb — Pm ■ In this case we 
just check their ?/-coordinate since we know that their x-coordinate belongs in 
[a, b]. When a point belongs in the query, we also check its two children and we 
do this recursively. After finishing with the upper level we go to the respective 
buckets by following a single pointer from the nodes of the upper level PST of 
which the points belong in the answer. Then we traverse in the same way the 
buckets and find the set of points to report. Finally, we check the extra PST for 
reported points. In total the query time is 0(logn + t) w.c. 

Note that deletions of points do not affect the correctness of the query algo- 
rithm. If a non violating point is deleted, it should reside on the lower level and 
thus it would be deleted online. Otherwise, the auxiliary PST contains it and 
thus the deletion is online again. No deleted violating point is incorporated into 
the upper level, since by the end of the epoch the PST contains only inserted 
violating points. 

Theorem 5. There exists a dynamic main memory data structure that sup- 
ports 3-sided queries in 0(logn -\- t) w.c. time, can be updated in O(loglogn) 
expected w.h.p. and consumes linear space, under the assumption that the x and 
y-coordinates are continuously drawn from a ^i-random distribution. 

If we implement the above solution by using EPSTs [5], instead of PSTs, 
then the solution becomes I/O-efHcient, however the update cost is amortized 
instead of worst case. Thus we get that: 

Theorem 6. There exists a dynamic external memory data structure that sup- 
ports 3-sided queries in 0(log^ n+t/ B) w.c. time, can be updated in 0(log^ logn) 
amortized expected w.h.p. and consumes linear space, under the assumption that 
the X and y-coordinates are continuously drawn from a ^-random distribution. 

6 The Second Solution for the Smooth and Random 
Distributions 

We will present the invented data structures in RAM and I/O model respectively. 



6.1 The Second Solution in RAM model 

Our internal memory construction for storing n points in tlie plane consists of 
an IS-tree storing the points in sorted order with respect to the ^-coordinates. 
On the sorted points, we maintain a weight balanced exponential search tree T 
with C2 = 3/2 and ci = 2^. Thus its height is (log log n). In order to use T 
as a priority search tree, we augment it as follows. The root stores the point 
with overall minimum ^/-coordinate. Points are assigned to nodes in a top-down 
manner, such that a node u stores the point with minimum y-coordinate among 
the points in T„ that is not already stored at an ancestor of u. Note that the 
point from a leaf of T can only be stored at an ancestor of the leaf and that 
the y-coordinates of the points stored at a leaf-to-root path are monotonically 
decreasing (Min-Heap Property). Finally, every node contains an RMQ-structure 
on the y-coordinates of the points in the children nodes and an array with 
pointers to the children nodes. Every point in a leaf can occur at most once in 
an internal node u and the RMQ-structure of w's parent. Since the space of the 
IS-tree is linear [30l[24], so is the total space. 

Querying the Data Structure: Before we describe the query algorithm of 
the data structure, we will describe the query algorithm that finds all points 
with y-coordinate less than c in a subtree T„. Let the query begin at an internal 
node u. At first we check if the y-coordinate of the point stored at u is smaller 
or equal to c (we call it a member of the query). If not we stop. Else, we identiiy 
the tu children of u storing points with y-coordinate less than or equal to c, 
using the RMQ-structure of u. That is, we first query the whole array and then 
recurse on the two parts of the array partitioned by the index of the returned 
point. The recursion ends when the point found has y-coordinate larger than c 
(non-member point). 

Lemma 6. For an internal node u and value c, all points stored in T„ with 
y-coordinate <c can be found in 0{t + 1) time, when t points are reported. 

Proof. Querying the RMQ-structure at a node v that contains ty member points 
will return at most ty + 1 non-member points. We only query the RMQ-structure 
of a node v if we have already reported its point as a member point. Summing 
over all visited nodes we get a total cost of O {J^vC^^v + l))=0(t + 1). □ 

In order to query the whole structure, we first process a 3-sided query [a, b] x 
(— oo,c] by searching for a and b in the IS-tree. The two accessed leaves a,b 
of the IS-tree comprise leaves of T as well. We traverse T from a and b to the 
root. Let Pa (resp. Pt) be the root-to-leaf path for a (resp. b) in T and let 
Pm, — Pa <^ Pb- During the traversal we also record the index of the traversed 
child. When we traverse a node u on the path Pa — Pm (resp. Pb — Pm), the 
recorded index comprises the leftmost (resp. rightmost) margin of a query to 
the RMQ-structure of u. Thus all accessed children by the RMQ-query will 
be completely contained in the query's x-range [a, 6]. Moreover, by Lem. [6]the 
RMQ-structure returns all member points in T^. 



For the lowest node in P™, i.e. the lowest common ancestor (LCA) of a and 
b, we query the RMQ-structure for all subtrees contained completely within a 
and b. We don't execute RMQ-queries on the rest of the nodes of Pm, since they 
root subtrees that overlap the query's x-range. Instead, we merely check if the 
X- and y-coordinates of their stored point lies within the query. Since the paths 
Pm, Pa — Pm and Pfj — have length 0(log log n), the query time of T becomes 
O (log log n + t). When the x-coordinates are smoothly distributed, the query to 
the IS-Tree takes O (log log n) expected time with high probability [SO]- Hence 
the total query time is O (log log n + t) expected with high probability. 

Inserting and Deleting Points: Before we describe the update algorithm of 
the data structure, we will first prove some properties of updating the points in 
T. Suppose that we decrease the y- value of a point p„ at node u to the value 
y' . Let V be the ancestor node of u highest in the tree with y-coordinate bigger 
than y' . We remove Pu from u. This creates an "empty slot" that has to be filled 
by the point of it's child with smallest y-coordinate. The same procedure has to 
be applied to the affected child, thus causing a "bubble down" of the empty slot 
until a node is reached with no points at its children. Next we replace w's point 
Py with Pu (swap). We find the child of v that contains the leaf corresponding 
to Py and swap its point with py. The procedure recurses on this child until an 
empty slot is found to place the last swapped out point ( "swap down"). In case 
of increasing the y-value of a node the update to T is the same, except that Pu 
is now inserted at a node along the path from u to the leaf corresponding to pu- 
For every swap we will have to rebuild the RMQ-structures of the parents 
of the involved nodes, since the RMQ-structures are static data structures. This 
has a linear cost to the size of the RMQ-structure (Sect. [3]). 

Lemma 7. Let i be the highest level where the point has been affected by an 
update. Rebuilding the RMQ-structures due to the update takes 0{w^^~^) time. 

Proof. The executed "bubble down" and "swap down" , along with the search 
for V, traverse at most two paths in T. We have to rebuild all the RMQ- 
structures that lie on the two w-to-leaf paths, as well as that of the parent 
of the top-most node of the two paths. The RMQ-structure of a node at level j 
is proportional to its degree, namely O {wj /wj^i). Thus, the total time becomes 



To insert a point p, we first insert it in the IS-tree. This creates a new leaf 
in T, which might cause several of its ancestors to overflow. We split them 
as described in Sec. |31 For every split a new node is created that contains no 
point. This empty slot is filled by "bubbling down" as described above. Next, we 
search on the path to the root for the node that p should reside according to the 
Min-Heap Property and execute a "swap down" , as described above. Finally, all 
aff'ected RMQ-structures are rebuilt. 

To delete point p, we first locate it in the IS-tree, which points out the 
corresponding leaf in T. By traversing the leaf-to-root path in T, we find the 




□ 



node in T that stores p. We delete the point from the node and "bubble down" the 
empty slot, as described above. Finally, we delete the leaf from T and rebalance T 
if required. Merging two nodes requires one point to be "swapped down" through 
the tree. In case of a share, we additionally "bubble down" the new empty slot. 
Finally we rebuild all affected RMQ-structures and update the IS-tree. 
Analysis: Wc assume that the point to be deleted is selected uniformly at 
random among the points stored in the data structure. Moreover, we assume 
that the inserted points have their ^-coordinates drawn independently at random 
from an (n", n-'^/^)-smooth distribution for a constant l/2<a<l, and that the 
^-coordinates are drawn from an arbitrary distribution. Searching and updating 
the IS-tree needs O(loglogn) expected with high probability [30l[24], under the 
same assumption for the a;-coordinates. 



Lemma 8. Starting with an empty weight balanced exponential tree, the amor- 
tized time of rebalancing it due to insertions or deletions is 0{1). 

Proof. A sequence of n updates requires at most 0{n/wi) rebalancings at level i 
(Lem.[2]). Rebuilding the RMQ-structures after each rebalancing costs 0(^w'^^~^) 
time (Lem.[7]). Summing over all levels, the total time becomes -S- . 

-1) = 0{nJ2^=f'''^^^ <'"^)= 0(n), when C2<2. □ 

Lemma 9. The expected amortized time for inserting or deleting a point in a 
weight balanced exponential tree is 0(1). 

Proof. The insertion of a point creates a new leaf and thus T may rebalance, 
which by Lemma [8] costs 0(1) amortized time. Note that the shape of T only 
depends on the sequence of updates and the x-coordinates of the points that 
have been inserted. The shape of T is independent of the y-coordinates, but the 
assignment of points to the nodes of T follows uniquely from the y-coordinates, 
assuming all y-coordinates are distinct. Let u be the ancestor at level i of the leaf 
for the new point p. For any integer fc > 1 , the probability of p being inserted at u 
or an ancestor of u can be bounded by the probability that a point from a leaf 
of Tu is stored at the root down to the fc-th ancestor of u plus the probability that 
the y-coordinate of p is among the k smallest y-coordinates of the leaves of T. The 
first probability is bounded by J2'j=i+k''^^ ^^^"^ , whereas the second probability 
is bounded by fc/ ^Wi. It follows that p ends up at the i-th ancestor or higher with 
probability at mosto(E-:'^'r^^ + T^) =0(E-:^r^u;;^^^^ = 
+ ^) - 0[wt^'^'^'''+i-) = O(^) for C2 = 3/2 and fc = 3. Thus 
the expected cost of "swapping down" p becomes Q(^^^^3ht(T) j_ l£±±i.^ = 

0(j:'^::^^^'^^^ w^-') = o(^^-f *(^) c^^^-'^^^) = 0(1) for C2 < 2. 

A deletion results in "bubbling down" an empty slot, whose cost depends on 
the level of the node that contains it. Since the point to be deleted is selected 



uniformly at random and there are 0{n/wi) points at level i, the probabil- 
ity that the deleted point is at level i is 0(1/11;^). Since the cost of an up- 
date at level i is O (wi+i/wi), we get that the expected "bubble down" cost is 
O *(^) ^ • ^) = 0(1) for C2 < 2. □ 

Theorem 7. In the RAM model, using 0{n) space, 3-sided queries can he sup- 
ported in 0(loglogn -I- t/ B) expected time with high probability, and updates in 
O(loglogn) time expected amortized, given that the x-coordinates of the inserted 
points are drawn from an {n" ,n^^^) -smooth distribution for constant l/2<a<l, 
the y- coordinates from an arbitrary distribution, and that the deleted points are 
drawn uniformly at random among the stored points. 

6.2 The Second Solution in I/O model 

We now convert our internal memory into a solution for the I/O model. First 
we substitute the IS-tree with its variant in the I/O model, the ISB-Tree [21]. 
We implement every consecutive 0{B^) leaves of the ISB-Tree with the data 
structure of Arge et al. [5]. Each such structure constitutes a leaf of a weight 
balanced exponential tree T that we build on top of the 0{n/B'^) leaves. 

In T every node now stores B points sorted by y-coordinate, such that 
the maximum j/-coordinate of the points in a node is smaller than all the y- 
coordinates of the points of its children (Min-Heap Property). The B points 
with overall smallest y-coordinates are stored at the root. At a node u we store 
the B points from the leaves of T„ with smallest y-coordinates that are not 
stored at an ancestor of u. At the leaves we consider the B points with smallest 
y-coordinate among the remaining points in the leaf to comprise this list. More- 
over, we define the weight parameter of a node at level i to be Wi=B'^'^'^^^^ . Thus 
we get Wi+i=wJ^^, which yields a height of ©(loglogg n). Let di=:^j^^=w\^'' de- 
note the degree parameter for level i. All nodes at level i have degree 0{di). Also 
every node stores an array that indexes the children according to their x-order. 

We furthermore need a structure to identify the children with respect to their 
y-coordinates. We replace the RMQ-structure of the internal memory solution 
with a table. For every possible interval [k, I] over the children of the node, we 
store in an entry of the table the points of the children that belong to this 
interval, sorted by y-coordinate. Since every node at level i has degree 0{di), 
there are 0{d1) different intervals and for each interval we store 0{B ■ di) points. 
Thus, the total size of this table is 0{B ■ d^) points or 0{d\) disk blocks. 

The ISB-Tree consumes 0{n/B) blocks [21]. Each of the Oin/B"^) leaves of T 
contains B^ points. Each of the n/wi nodes at level i contains B points and a ta- 
ble with 0{B-d^) points. Thus, the total space is 0^n-|-^^^\^'^**^^^n-i?-df /w.;^ = 

o{n+Y!l^f^'''^'^ n-B / {B'^ Vy^ = 0{n) points, i.e. 0{n/B) disk blocks. 

Querying the Data Structure: The query is similar to the internal memory 
construction. First we access the ISB-Tree, spending 0(log^ log n) expected I/Os 



with high probabihty, given that the x-coordinates are smoothly distributed |21| . 
This points out the leaves of T that contain a, 6. We perform a 3-sided range 
query at the two leaf structures. Next, we traverse upwards the leaf-to-root path 
Pa (resp. Pb) on T, while recording the index k (resp. I) of the traversed child in 
the table. That costs ©(loglog^n) I/Os. At each node we report the points of 
the node that belong to the query range. For all nodes on Pa — Pb and Pf, — Pa we 
query as follows: We access the table at the appropriate children range, recorded 
by the index k and /. These ranges are always [k + l,last child] and [0, 1 — 1] for 
the node that lie on Pa — Pb and Pb — Pa, respectively. The only node where 
we access a range [A: -f 1, Z — 1] is the LCA of the leaves that contain a and h. 
The recorded indices facilitate access to these entries in 0(1) I/Os. We scan the 
list of points sorted by y-coordinate, until we reach a point with y-coordinate 
bigger than c. All scanned points are reported. If the scan has reported all B 
elements of a child node, the query proceeds recursively to that child, since more 
member points may lie in its subtree. Note that for these recursive calls, we do 
not need to access the B points of a node w, since we accessed them in w's parent 
table. The table entries they access contain the complete range of children. If 
the recursion accesses a leaf, we execute a 3-sided query on it, with respect to a 
and h 0- 

The list of B points in every node can be accessed in 0(1) I/Os. The con- 
struction of ^ allows us to load the B points with minimum y-coordinate in a 
leaf also in 0(1) I/Os. Thus, traversing Pa and Pb costs ©(loglog^ n) I/Os worst 
case. There are 0(loglog^ n) nodes u on Pa — Pm and Pb — Pm- The algorithm re- 
curses on nodes that lie within the x-range. Since the table entries that we scan 
are sorted by y-coordinate, we access only points that belong to the answer. 
Thus, we can charge the scanning I/Os to the output. The algorithm recurses on 
all children nodes whose B points have been reported. The I/Os to access these 
children can be charged to their points reported by their parents, thus to the 
output. That allows us to access the child even if it contains only o{B) member 
points to be reported. The same property holds also for the access to the leaves. 
Thus we can perform a query on a leaf in 0{t/B) I/Os. Summing up, the worst 
case query complexity of querying T is 0(loglog^ n + ^) I/Os. Hence in total 
the query costs 0(loglog^ n+ expected I/Os with high probability. 

Inserting and Deleting Points: Insertions and deletions of points are in 
accordance with the internal solution. For the case of insertions, first we update 
the ISB-tree. This creates a new leaf in the ISB-tree that we also insert at the 
appropriate leaf of T in 0(1) I/Os [5]. This might cause some ancestors of the 
leaves to overflow. We split these nodes, as in the internal memory solution. 
For every split B empty slots "bubble down". Next, we update T with the new 
point. For the inserted point p we locate the highest ancestor node that contains 
a point with j/-coordinate larger than p's. We insert p in the list of the node. 
This causes an excess point, namely the one with maximum y-coordinate among 
the B points stored in the node, to "swap down" towards the leaves. Next, we 
scan all affected tables to replace a single point with a new one. 



In case of deletions, we search the ISB-tree for the deleted point, which points 
out the appropriate leaf of T. By traversing the leaf-to-root path and loading the 
list of B point, we find the point to be deleted. We remove the point from the 
list, which creates an empty slot that "bubbles down" T towards the leaves. Next 
we rebalance T as in the internal solution. For every merge we need to "swap 
down" the B largest excess points. For a share, we need to "bubble down" B 
empty slots. Next, we rebuild all affected tables and update the ISB-tree. 
Analysis: Searching and updating the ISB-tree requires 0{\ogg logn) expected 
I/Os with high probability, given that the a;-coordinates are drawn from an 
(n/(loglogn)^+^,ri^/^)-smooth distribution, for constant e>0 [2l1. 

Lemma 10. For every path corresponding to a "swap down" or a "bubble down" 
starting at level i, the cost of rebuilding the tables of the paths is 0(^df^i^ I/Os. 

Proof. Analogously to Lem. [71 a "swap down" or a "bubble down" traverse at 
most two paths in T. A table at level j costs 0{d^) I/Os to be rebuilt, thus all 



Lemma 11. Starting with an empty external weight balanced exponential tree, 
the amortized I/Os for rebalancing it due to insertions or deletions is 0(1). 

Proof. We follow the proof of Lem. |S1 Rebalancing a node at level i requires 
0{d^j^i+B-df) I/Os (Lem. [lOl), since we get B "swap downs" and "bubble 
downs" emanating from the node. The total I/O cost for a sequence of n updates 



Lemma 12. The expected amortized I/Os for inserting or deleting a point in 
an external weight balanced exponential tree is 0{1). 

Proof. By similar arguments as in Lem.|9]and considering that a node contains B 
points, we bound the probability that point p ends up at the i-th ancestor or 
higher by 0{B/wi). An update at level i costs 0{d^^^)=0{w}/'^) I/Os. Thus 

"swapping down" p costs 0(^^^\®''*'-^'?«j^/^~)=0(l) expected I/Os. The same 
bound holds for deleting p, following similar arguments as in Lem. [91 □ 

Theorem 8. In the I/O model, using 0{n/B) disk blocks, 3-sided queries can be 
supported in ©(loglog^ n + t/B) expected I/Os with high probability, and updates 
in O(log^logn) I/Os expected amortized, given that the x-coordinates of the 
inserted points are drawn from an {n/([og\ognY~^'^ ,n^/^)-smooth distribution 
for a constant £ > 0, the y-coordinates from an arbitrary distribution, and that 
the deleted points are drawn uniformly at random among the stored points. 

7 The Third Solution for the Smooth and the Restricted 
Distributions 

We would like to improve the query time and simultaneously preserve the update 
time. For this purpose we will incorporate to the structure the MPST, which is 




□ 



a static data structure. We will dynamize it by using the technique of global 
rebuilding [37], which unfortunately costs 0{n) time. 

In order to retain the update time in the same sublogarithmic levels, we 
must ensure that at most a logarithmic number of lower level structures will be 
violated in a broader epoch of 0{n) updates. Since the violations concern the 
2/-coordinate we will restrict their distribution to the more restricted class, since 
Theorem [3] ensures exactly this property. Thus, the auxiliary PST consumes at 
most O(logn) space during an epoch. 

Moreover, we must waive the previous assumption on the ^-coordinate dis- 
tribution, as well. Since the query time of the previous solution was O(logn) we 
could afford to pay as much time in order to locate the leaves containing a and 
b. In this case, though, this blows up our complexity. If, however, we assume 
that the a;-coordinates are drawn from a (n", n^)-smooth distribution, we can 
use an IS-tree to index them, given that < a,/3 < 1. By doing that, we pay 
w.h.p. O(loglogn) time to locate a and b. 




Fig. 3. The internal memory construction for the restricted distributions 

When a new epoch starts we take all points from the extra PST and insert 
them in the respective buckets in time 0(loglog7i) w.h.p. During the epoch 
we gather all the violating points that should access the MPST and the points 
that belong to it and build in parallel a new MPST on them. At the end of 
the 0{n) epoch, we have built the updated version of the MPST, which we use 
for the next epoch that just started. By this way, we keep the MPST of the 
upper level updated and the size of the extra PST logarithmic. By incrementally 
constructing the new MPST we spend 0(1) time worst case for each update of 



the epoch. As a resuh, the update operation is carried out in 0(loglog7i) time 
expected with high probabihty. 

For the 3-sided query [a, b] x (—00, c], we first access the leaves of the lower 
level that contain a and 6, through the IS-tree. This costs 0(log log n) time w.h.p. 
Then the query proceeds bottom up in the standard way. First it traverses the 
buckets that contain a and b and then it accesses the MPST from the leaves of the 
buckets' representatives. Once the query reaches the node of the MPST with y- 
coordinate bigger than c, it continues top down to the respective buckets, which 
contain part of the answer, by following a single pointer from the nodes of the 
upper level MPST. Then we traverse top down these buckets and complete the 
set of points to report. Finally, we check the auxiliary PST for reported points. 
The traversal of the MPST is charged on the size of the answer 0{t) and the 
traversal of the lower level costs 0(log log n) expected with high probability. Due 
to Theorem [31 the size of the auxiliary PST is with high probability O(logn), 
thus the query spends O (log log n) expected with high probability for it. Hence, 
in total the query time is O (log log n + t). 

Theorem 9. There exists a dynamic main memory data structure that sup- 
ports 3-sided queries in 0(loglogn + t) time expected w.h.p., can be updated in 
O(loglogn) expected w.h.p. and consumes linear space, under the assumption 
that the x-coordinates are continuously drawn from a ^-random distribution and 
the y-coordinates are drawn from the restricted class of distributions. 

In order to extend the above structure to work in external memory we will 
follow a similar scheme with the above structure. We use an auxiliary EPST 
and index the leaves of the main structure with and ISB-tree. This imposes that 
the x-coordinates are drawn from a {n/ (log log n^'^'^ , n^~^)-smooth distribution, 
where e > and S = 1 — otherwise the search bound would not be expected 
to be doubly logarithmic. Moreover, the main structure consists of three levels, 
instead of two. That is, we divide the n elements into n' = buckets of size 

' log n 

logn, which we implement as EPSTs (instead of PSTs). This will constitute 
the lower level of the whole structure. The n' representatives of the EPSTs are 
again divided into buckets of size 0{B), which constitute the middle level. The 
n" — ^ representatives are stored in the leaves of an external MPST (EMPST), 
which constitutes the upper level of the whole structure. In total, the space of 
the aforementioned structures is 0{n' -{- n" -\- n" log^'^-' n") — 0{j^^ -\- g^^gn + 

g^B) = O(i^) - 0(f ), where k is such that log*'^^ n" = 0{B) holds. 

The update algorithm is similar to the variant of internal memory. The query 
algorithm first proceeds bottom up. We locate the appropriate structures of 
the lower level in (^(log^logn) I/Os w.h.p., due to the assumption on the x- 
coordinates. The details for this procedure in the I/O model can be found in 
[21] . Note that if we assume that the x-coordinates are drawn from the grid 
distribution with parameters then this access step can be realized in 

0(1) I/Os. That is done by using an array A of size M as the access data 
structure. The position A[i] keeps a pointer to the leaf with x-coordinate not 
bigger than i [33]. Then, by executing the query algorithm, we locate the at 




Fig. 4. The external memory construction for tire restricted distributions 



most two structures of the middle level that contain the representative leaves of 
the EPSTs we have accessed. Similarly we find the representatives of the middle 
level structures in the EMPST. Once we reached the node whose minimum 
2/-coordinate is bigger than c, the algorithm continues top down. It traverses 
the EMPST and accesses the structures of the middle and the lower level that 
contain parts of the answer. The query time spent on the EMPST is 0{t/B) 
I/Os. All accessed middle level structures cost 0{2 + t/B) I/Os. The access 
on the lower level costs Oi^ogg logn + t/B) I/Os. Hence, the total query time 
becomes 0{\ogg logn + t/B) I/Os expected with high probability. We get that: 

Theorem 10. There exists a dynamic external memory data structure that sup- 
ports 3-sided queries in 0(log^ logn + t/B) expected w.h.p., can be updated in 
O(log^logn) expected w.h.p. and consumes 0{n/B) space, under the assump- 
tion that the x-coordinates are continuously drawn from a smooth- distribution 
and the y- coordinates are drawn from the restricted class of distributions. 

8 Conclusions 

We considered the problem of answering three sided range queries of the form 
[a, b] X {—oo, c] under sequences of inserts and deletes of points, trying to attain 
linear space and doubly logarithmic expected w.h.p. operation complexities, un- 
der assumptions on the input distributions. We proposed three solutions, which 



we modified appropriately in order to work for the RAM and the I/O model. 
All of them consist of combinations of known data structures that support the 
3-sided query operation. 

The internal variant of the first solution combines Priority Search Trees [2 9) 
and achieves O(loglogn) expected w.h.p. update time and 0(log n+t) w.c. query 
time, using linear space. Analogously, the external variant of the first solution 
combines External Priority Search Trees 5^ and achieves the update operation in 
0(log5 logn) I/Os expected w.h.p. and the query operation in 0{\ogg n + t/B) 
I/Os amortized expected w.h.p., using linear space. The bounds are true under 
the assumption that the x and y-coordinates are drawn continuously from /x- 
random distributions. 

The internal variant of the second solution combines exponential weight bal- 
anced trees with RMQ structures and achieves O (log log n -I- t) expected query 
time with high probability and O(loglogn) expected amortized update time. 
Analogously, the external variant of the second solution achieves the update op- 
eration in O(log^logn) expected amortized I/Os and the query operation in 
Oiloglog^n + t/B) expected I/Os with high probability. The main drawback 
of this solution appears in the I/O-approach, where the block-size factor B is 
presented in the second logarithm (0(loglog3 n)). 

In order to improve the latter, we proposed a third solution with stronger 
assumptions on the coordinate distributions. We restricted the y-coordinates to 
be continuously drawn from a restricted distribution and the x- coordinates to 
be drawn from {f{n), (7(n))-smooth distributions, for appropriate functions / 
and g, depending on the model. The internal variant of this solution can be 
accessed by a IS-tree [24], incorporates the Modified Priority Search Tree [22] 
and decreases the query complexity to 0(log log n+t) expected w.h.p., preserving 
the update and space complexity. The external variant combines the External 
Modified Priority Search Tree, which was presented here, with External Priority 
Search Trees and is accessed by an ISB-tree [H]. The update time is Oilog^ log n) 
I/Os expected w.h.p., the query time is Oilog^ \ogn + 1/ B) I/Os and the space 
is linear. 

The proposed solutions are practically implementable. Thus, we leave as a 
future work an experimental performance evaluation, in order to prove in practice 
the improved query performance and scalability of the proposed methods. 
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